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ABSTRACT 

Natural antisense transcripts (NATs), as one 
type of regulatory RNAs, occur prevalently in plant 
genomes and play significant roles in physiologic- 
al and pathological processes. Although their 
important biological functions have been reported 
widely, a comprehensive database is lacking up to 
now. Consequently, we constructed a plant NAT 
database (PlantNATsDB) involving approximately 2 
million NAT pairs in 69 plant species. GO annotation 
and high-throughput small RNA sequencing data 
currently available were integrated to investigate 
the biological function of NATs. PlantNATsDB 
provides various user-friendly web interfaces to fa- 
cilitate the presentation of NATs and an integrated, 
graphical network browser to display the complex 
networks formed by different NATs. Moreover, a 
'Gene Set Analysis' module based on GO annotation 
was designed to dig out the statistical significantly 
overrepresented GO categories from the specific 
NAT network. PlantNATsDB is currently the most 
comprehensive resource of NATs in the plant 
kingdom, which can serve as a reference database 
to investigate the regulatory function of NATs. The 
PlantNATsDB is freely available at http://bis.zju.edu. 
cn/pnatdb/. 

INTRODUCTION 

Gene regulation at RNA level has been progressively 
shown to be more important and prevalent than previous- 
ly presumed (1,2). With the advances of high-throughput 
experimental technologies and bioinformatics methods, 
an explosion of recent findings underscores both the 



predominance and complexity of regulatory RNA mol- 
ecules in eukaryotes, including the discovery of ubiquitous 
regulatory short non-coding RNAs (ncRNAs) (3), 
including microRNAs (miRNAs), endogenous short 
interfering RNAs (siRNAs) and Piwi-interacting RNAs 
(piRNAs), and the functional long ncRNAs (1,4). 
Natural antisense transcripts (NATs), as a new member 
of regulatory RNAs, occur prevalently in prokaryote and 
eukaryote genomes, and play significant roles in physio- 
logical and/or pathological processes (5). NATs are a 
group of endogenous RNA molecules containing 
sequences that are complementary to other transcripts 
(5-7). This class of RNAs includes both protein- and 
non-coding transcripts. NATs can be grouped into two 
categories, cw-NATs and trans-NATs, based on whether 
they act in cis or trans. Cis-NAT pairs are transcribed 
from opposing DNA strands at the same genomic locus 
and have a variety of orientations and differing lengths of 
overlap between the perfect sequence complementary 
regions, whereas trans-NAT pairs are transcribed from 
different loci and form partial complementarily (5). 
Although underlying mechanistic insights are largely 
unknown, NATs have been implicated in many aspects 
of gene regulation including genomic imprinting, tran- 
scriptional interference, RNA masking, RNA editing, 
RNA interference (RNAi) and translational regulation 
(5,7,8). However, since the discovery of the founder 
example of ri.v-NATs, SR05 and P5CDH, involving in 
the regulation of salt tolerance through RNAi pathway 
in Arabidopsis (Arabidopsis thaliana) (9), more and more 
examples of NATs have been shown to act together with 
endogenous siRNAs (nat-siRNAs) from the overlapping 
regions in both plant and animal species (10-16). 
Moreover, deep sequencing of small RNAs (sRNAs) 
together with bioinformatics analysis reveals that the 
overlap portions of NATs are the hotspots for siRNA 
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generation (12,13,17,18), further indicating that NATs 
are an important biogenesis mechanism of endogenous 
siRNAs. These recent discoveries revealed the unexpect- 
ed complexity of the regulatory networks formed by 
NATs (17). 

Whole-genome searches based on computational 
analysis have identified thousands of NAT pairs in 
multiple eukaryotes. Thus, standardized applications or 
databases are required for data description, deposition, or- 
ganization, parsing and analysis, and also allowing for 
functional discovery by integrating other biological data. 
To date, there are just a few free available NAT databases, 
one of which, NATsDB (19), comprises 10 animal species. 
However, the existing databases mainly focus on cw-NATs 
and none of them expand to any plant species, although 
both cis-NATs and trans-NATs have been reported in 
several plant species including two model plants, the 
monocot rice (Oryza sativa) (17,18,20) and the eudicot 
Arabidopsis (18,21-23). Furthermore, the functional anno- 
tation and graphical visualization of the NATs is limited. 



In the current analysis, we developed a genome- 
scale computational pipeline to identify NATs in 
plant species. A convenient database of plant NATs 
(PlantNATsDB) was constructed, which contains 69 
plant species and provides the most comprehensive data 
set to date. PlantNATsDB serves the plant research 
community by providing facilitated access to a huge 
amount of resources regarding the NATs as well as a 
variety of specific analysis tools including browsing, 
searching, viewing, downloading and so on. In addition, 
it integrates Gene Ontology (GO) annotation (24) and 
sRNA high-throughput sequencing data sets to evalu- 
ate and investigate the function of NATs. Moreover, a 
'Gene Set Analysis' module based on GO annotation 
was implemented to excavate the statistical signifi- 
cantly overrepresented GO categories from the complex 
network formed by different NATs. PlantNATsDB 
provides an information rich and user-friendly interface 
and an integrated, graphical network browser to facilitate 
mining-specific functional NAT pairs (Figure 1). Detailed 




Figure 1. System overview of PlantNATsDB core framework. (A) Schematic presentation of the five different types of cts-NATs (natural antisense 
transcripts) (i.e. Divergent, Convergent, Containing, Nearby head-to-head and Nearby tail-to-tail) and the trans-NATs predicted by PlantNATsDB. 
The complementary regions are highlighted and linked with vertical lines. Sequences used for NAT prediction were retrieved from variously public 
databases, as detailed in the website page. All NATs predicted by PlantNATsDB were deposited in MySQL relational databases. (B) Highlighted 
features of PlantNATsDB, which integrates various data to evaluate the function of NATs. [B(a)] The 69 plant species currently available in 
PlantNATsDB. [B(b)] Network formed by different NATs displayed in the integrated network browser, which is based on Cytoscape Web 
program (31). Note that this network can be edited and used for further analysis, such as 'Gene Set Analysis'. An example of the output for 
'Small RNA Expression' of a NAT pair is shown in [B(c)] and 'GO Annotation' in [B(d)]. Please note that small RNAs are enriched in the 
overlapped region and the two genes of the NAT pair share very similar GO annotation. [B(e)] An example of the output for 'Gene Set 
Analysis' based on GO annotation. The enriched GO categories are listed in the table and the />-value indicating the significance of enrichment. 
The number of genes in each GO category is indicated and shown in the pie chart. Additional functional modules, such as 'Browser', 'Searcher' and 
'Viewer' as detailed in the PlantNATsDB website. 
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information is provided at the PlantNATsDB website 
(http://bis.zju.edu.cn/pnatdb/). 

DATABASE CONSTRUCTION 

Data source 

Of the 69 plant species, 27 have genomic information. For 
these 27 genomically sequenced species, the annotated 
transcription units (TUs) used for NAT prediction and 
other annotation information were downloaded from the 
specific genome-sequencing projects. Based on the fact 
that pseudogenes and transposons can form NATs with 
protein-coding genes (10,11,17), all the pseudogenes and 
transposons were retained for NAT prediction. For the 
remaining 42 plant species, the tentative consensus se- 
quences (TCs), which can be used to provide putative 
genes with functional annotation similar to TUs were 
used for NAT prediction, and their related information 
were downloaded from The Gene Index Project (25). 

SRNA high-throughput sequencing data sets of each 
species were obtained from the Gene Expression 
Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/ 
geo/) (26). All the sRNA data sets retrieved for this study 
were summarized in Table 1. 

Prediction of NAT pairs 

Prediction of NAT pairs was performed as previously 
described (17,18,22). Specifically, the following criteria 
were used to identify cw-NATs and trans-NATs, 
respectively. 



Table 1. Summary statistics of small RNA data sets in 
PlantNATsDB 



No. 


Species 


GEO Data sets" 






Series 


Samples 


1 


Arabidopsis thaliana 


15 


80 


2 


Arabidopsis lyrata 


3 


8 


3 


Brachypodium distachyon 


2 


4 


4 


Chlamydomonas reinhardtii 


3 


6 


5 


Citrus sinensis 


1 


2 


6 


Gossypium hirsutum 


2 


6 


7 


Glycine max 


2 


5 


g 


Medicago truncatula 


2 


5 


9 


Nicotiana benthamiana 


2 


6 


10 


Oryza sativa suhsp. indica 


1 


2 


11 


Oryza sativa subsp. japonica 


6 


38 


12 


Physcomitrella patens 


3 


10 


13 


Primus persica 


1 


2 


14 


Solanum lycopersicum 


1 


2 


15 


Selaginella moeUendorffii 


1 


1 


16 


Triticum aestivum 


2 


2 


17 


Vitis vinifera 


2 


5 


18 


Zea mays 


4 


12 


Total 




54 


196 



"Number of GEO Series or GEO Samples in each species, including 
biological and technical replicates. Detailed information of the data sets 
in each species is provided at the PlantNATsDB website. 
Note that all small RNA data sets in this study were downloaded from 
the GEO database (http://www.ncbi.nlm.nih.gov/geo/) (26). 



For m-NATs, they can be grouped into five categories, 
namely: (i) Divergent (head to head or 5' to 5' overlap); 
(ii) Convergent (tail-to-tail or 3' to 3' overlap); (hi) 
Containing (full overlap); (iv) Nearby head-to-head 
(5' close to 5') and (v) Nearby tail-to-tail (3' close to 3') 
according to their relative orientation and degree of 
overlap (Figure 1A) (27). If a pair of transcripts is 
located in opposite strands at adjacent genomic loci and 
has at least 1 nt overlapping, or their distance on the 
chromosome is no >100nt, then they were considered as 
a cw-NAT pair. In total, 27 plant species were subjected to 
m-NAT prediction. 

For trans-NATs, BLASTN (ftp://ftp.ncbi.nlm.nih.gov/ 
blast/executables/release/, Release 2.2.20) (28) was used to 
search for transcript pairs with high sequence complemen- 
tary to each other and the following criteria should be 
satisfied for each transcript pair: (i) If the complementary 
region identified by BLAST covered more than half the 
length of either transcript, this transcript pair was 
designated to be a 'high-coverage' (HC) trans-NAT pair; 
(ii) If the two transcripts had a continuous complementary 
region >100nt, they were classified as a '100-nt' pair. 
Functional trans-NATs should form RNA-RNA 
duplexes in vivo. We therefore used DINAMelt (29) to 
verify whether the transcript pairs could melt into 
RNA-RNA duplexes in the complementary regions 
in silico. All the (ra/w-NAT pairs based on BLAST 
search were further used to DINAMelt hybridization val- 
idation. The trans-NAT pair was retained if it satisfied: (i) 
the paired region indentified by DINAMelt should be co- 
incident with the BLAST-based search; (ii) any bubble in 
the paired region predicted by DINAMelt should be no 
longer than 10% of the region. For the BLAST-based 
trans-NAT pairs that contain transcripts >10kb, they 
were not applied to DINAMelt validation due to the 
heavy computational work. Instead, if the paired region 
identified by BLAST was >10% of its longer transcript, it 
was considered as verified trans-NAT. 

All the NAT pairs predicted in this study were 
summarized in Table 2. 

Small RNA analysis 

SRNA sequences containing incomplete information 
(such as containing 'N') with length <18 or >28 were 
removed for further analysis. For each data set, the 
filtered sRNA sequences were mapped to all the gene 
models of the related plant species. All mapping steps 
were performed using the Bowtie algorithm (30) 
allowing no mismatch. Besides, for comparison, the 
normalized abundance of sRNAs from each data set was 
calculated as RPMs (reads per million), which divided the 
read number of each sRNA by the total reads from this 
data set, and multiplied by 10 . 

For each NAT, an enrichment score was calculated to 
evaluate whether sRNAs were enriched in the overlapping 
region (17,18). The enrichment score E was calculated 
using the following formula: 

J? So/ Lq 

Sal L s 
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where S 0 = the total normalized abundance of the 
sRNAs generated from the overlapping region, L Q = the 
total length of the paired region of the two 
transcripts of the NATs, 5 a = the total normalized abun- 
dance of the sRNAs generated from these two tran- 
scripts and L a = the total length of the two transcripts. 
Furthermore, a standard x~ test (Pearson's chi-square 
test) was performed to test the significance of the 
enrichment. 

Database implementation 

All the predicted NATs and processed sRNAs were 
organized and stored in the MySQL database (http:// 
www.mysql.com/). Besides, the gene sequence informa- 
tion, annotated gene models and their functional annota- 
tions, including GO annotations, were collected and 
stored in the database. These genes can also be linked 
to external genome browsers. PlantNATsDB was 
implemented in JSP language and deployed on the 
Apache Tomcat web server (http://tomcat.apache.org/). 
The integrated network browser is created by Cytoscape 
Web program (http://cytoscapeweb.cytoscape.org/) (31). 
JavaScript and adobe flash player are required in 
order to use the full functionality of PlantNATsDB. 
PlantNATsDB can be accessed through IE 6.0 or higher, 
Netscape 7.0 or higher, Safari, Opera, Chrome and 
Firefox from multiple platforms. 

WEB INTERFACE AND DATABASE USAGE 

Search modules 

PlantNATsDB provides various query interfaces and 
graphical visualization tools to facilitate the retrieve and 
demonstration of NAT data. Four major search modules 
for retrieving NATs are designed: 'Simple Searcher', 
'Batched Searcher', 'Advanced Searcher' and 'BLAST 
Searcher'. Alternatively, users can get the entire NAT 
list by species in the 'Browser' module. The 'Simple 
Searcher' module allows users to enter any keyword in 
all fields for all data entries, including gene locus identi- 
fiers (IDs), gene aliases or any words in their annotation 
texts. The 'Batched Searcher' module supports gene set 
search, which allows users to enter a list of gene locus 
IDs or gene aliases. The 'Advanced Searcher' was 
designed to facilitate users to access any NAT data ac- 
cording multiple options such as the plant species, the 
types of NATs, the length of overlapping regions and 
the GO annotation. In addition, users can perform a 
BLAST sequence search to retrieve NAT data in the 
fourth module, 'BLAST Searcher'. All the search results 
performed by the above search modules can be further 
used for functional investigation (see below). 

NAT information page 

For each NAT pair, PlantNATsDB provides rich annota- 
tion according to the relationships between the related two 
genes. The result page largely comprises four main parts, 
i.e. NAT summarization, gene information, GO annota- 
tion and sRNA expression. Generally, all parts are 



displayed vividly in the graphical fashion. Figure 2 
shows the example of SR05 and P5CDH cis-NAT pair 
(9). The first part is the summary of NAT information and 
the overlapping region is highlighted (Figure 2A). The 
second part shows the detailed annotation of the two 
genes (Figure 2B). The third part displays the GO func- 
tional assessment of this NAT pair based on the GO an- 
notation of the two genes (Figure 2C). Functional NAT 
pairs are expected to have similar 'Molecular Function', 
involve in the related 'Biological Process' and/or locate in 
the same 'Cellular Component'. Therefore, the same GO 
terms shared by the two genes are highlighted in the GO 
network graph. The information provided in this part is 
very useful for evaluating the function of NAT. The last 
part provides sRNA expression derived from the NAT 
pair (Figure 2D). Based on the finding that sRNAs were 
the important component in the NAT regulatory pathway 
(9), most, if not all, of the sRNA data sets currently avail- 
able were collected and further processed and organized 
into the database (Table 1). Thus, these invaluable data 
sources will be of much help to users to inquire the 
function of NATs. Furthermore, a user-friendly interface 
is provided that allows users to add or remove data sets 
for analysis and to highlight different regions of the NAT. 

GO functional analysis module 

Gene set analysis based on GO annotation (24) and stat- 
istical test is widely used to identify enriched GO 
categories and to explore the most important biological 
terms associated with the given gene set. A 'Gene Set 
Analysis' module (Figure IB) has been developed for 
organizing a set of genes based on GO annotation, 
where the set of genes can be found by the search 
modules (see above) or collected in the network formed 
by NAT pairs (see below). Here, we used the combination 
of the x 2 test and Fisher's exact test to evaluate the sig- 
nificance of enrichment for GO category. Detailed 
methods can be referred from the PlantNATsDB website. 

Graphical interaction network visualization 

One gene may forms multiple NAT pairs with other anti- 
sense transcription partners, just as multiple paralogous 
genes may form RNA duplexes with the same antisense 
transcripts. Different NAT pairs might form complex 
regulatory networks in the related process (17). To this 
end, a graphical browser based on Cytoscape Web 
program (31) was developed to display the network 
formed by different NAT pairs (Figure IB). Different 
types of nodes (genes) and relationships (NAT pairs) are 
colored distinctly. Moreover, the network graph can be 
edited (such as, to click/double-click/right-click the 
nodes/edges, to delete the nodes/edges, to apply distinct 
layouts and to export the graph in various formats) and all 
the genes contained in the network can be further subject 
to gene set analysis based on GO annotation (see above). 
In addition, users can use the toolkit of 'My Network', 
where genes or NAT pairs of interest may be stored tem- 
porarily on the server side during the session period and 
later retrieved in the 'My Network' page. There is a button 
to add selected genes or NAT pairs to 'My Network' in 
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NAT Summarization Gene Information 




GO Annotation Small RNA Expression 

Figure 2. The information page of the NAT pair formed by SR05 (AT5G62520) and P5CDH (AT5G62530) (9). (A) Summary of the NAT infor- 
mation, including the type, sequences and length of overlapped region. The sequence of the overlapped region is highlighted below. (B) Detailed 
annotation of the two genes of this cis-NAT pair. (C) GO functional annotation of the two genes. The annotated GO terms are displayed in Venn 
chart and GO network graph. The GO network graph contains two types of nodes: those that represent the NAT pairs (triangle nodes) and those 
that represent GO hierarchical terms (circle nodes). The shared GO terms (red color) and specific GO terms (purple and green colors) are shown in 
different colors. Functional similarity of these two genes is represented by the percent of shared GO terms. (D) The expression of the small RNAs 
derived from the NAT pair. Small RNAs from different data sets are indicated by dots in different colors. The overlapped region is highlighted in 
the chart. Small RNA data sets can be added or removed for demonstration in the chart by clicking the buttons below. The enrichment score for 
small RNA generated from the overlapped region is calculated based on the specific data sets. Please note that there is no observation of enriched 
small RNA derived from the overlapped region because this NAT pair is specially formed in the salt stress condition and PlantNATsDB lacks such 
data sets. 



many pages of the website, which will greatly facilitate 
users 1 digging out specific biological network formed by 
related NAT pairs involved in regulation of the 
interrelated process. 

SUMMARY AND FUTURE DIRECTIONS 

This work presents a comprehensive collection of plant 
NATs, which are organized and deposited in an online 
database named PlantNATsDB. The biological function 
of NAT pairs can be elucidated from the variously 
integrated data currently available. Moreover, vivid web 
interfaces are also designed to facilitate the presentation of 
NATs. PlantNATsDB serves the plant research commu- 
nity by providing a reference database to investigate the 
functions of NATs. 

In the near future, PlantNATsDB will collect and 
include more experimentally validated data and plan to 
make distinction between experimentally determined and 
predicted NATs. In addition, more useful and precise 



algorithms or tools will be designed to evaluate the func- 
tions of NAT pairs or to dig out functional NAT pairs 
based on GO network graphs and NATs-formed regula- 
tory network. For example, it would be helpful to put such 
a regulatory subnetwork graph to the context of a larger 
network. Besides, some NAT pairs or subnetworks 
formed by NATs may be conserved between species. 
PlantNATsDB intends to allow users to select a specific 
family and to make comparisons within the family 
members. 

As new and improved high-throughput technologies are 
applied to a broader set of species, cell lines, tissues and 
conditions, more and more data sets will be generated, 
PlantNATsDB will be continuously maintained and 
timely updated to keep up with these improvements. In 
addition, gene expression data, such as ESTs (expression 
sequence tags), microarray and RNA-Seq data and 
degradome-sequencing data (32,33) will be integrated 
into PlantNATsDB to improve our understanding of the 
regulatory networks formed by NATs. 
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